Goto

Collaborating Authors

 random forest


Overfitting or perfect fitting? Risk bounds for classification and regression rules that interpolate

Neural Information Processing Systems

Many modern machine learning models are trained to achieve zero or near-zero training error in order to obtain near-optimal (but non-zero) test error. This phenomenon of strong generalization performance for ``overfitted'' / interpolated classifiers appears to be ubiquitous in high-dimensional data, having been observed in deep networks, kernel machines, boosting and random forests. Their performance is consistently robust even when the data contain large amounts of label noise. Very little theory is available to explain these observations. The vast majority of theoretical analyses of generalization allows for interpolation only when there is little or no label noise. This paper takes a step toward a theoretical foundation for interpolated classifiers by analyzing local interpolating schemes, including geometric simplicial interpolation algorithm and singularly weighted $k$-nearest neighbor schemes. Consistency or near-consistency is proved for these schemes in classification and regression problems.


When do random forests fail?

Neural Information Processing Systems

Random forests are learning algorithms that build large collections of random trees and make predictions by averaging the individual tree predictions. In this paper, we consider various tree constructions and examine how the choice of parameters affects the generalization error of the resulting random forests as the sample size goes to infinity. We show that subsampling of data points during the tree construction phase is important: Forests can become inconsistent with either no subsampling or too severe subsampling. As a consequence, even highly randomized trees can lead to inconsistent forests if no subsampling is used, which implies that some of the commonly used setups for random forests can be inconsistent. As a second consequence we can show that trees that have good performance in nearest-neighbor search can be a poor choice for random forests.


Random Forests as Statistical Procedures: Design, Variance, and Dependence

O'Connell, Nathaniel S.

arXiv.org Machine Learning

We develop a finite-sample, design-based theory for random forests in which each tree is a randomized conditional predictor acting on fixed covariates and the forest is their Monte Carlo average. An exact variance identity separates Monte Carlo error from a covariance floor that persists under infinite aggregation. The floor arises through two mechanisms: observation reuse, where the same training outcomes receive weight across multiple trees, and partition alignment, where independently generated trees discover similar conditional prediction rules. We prove the floor is strictly positive under minimal conditions and show that alignment persists even when sample splitting eliminates observation overlap entirely. We introduce procedure-aligned synthetic resampling (PASR) to estimate the covariance floor, decomposing the total prediction uncertainty of a deployed forest into interpretable components. For continuous outcomes, resulting prediction intervals achieve nominal coverage with a theoretically guaranteed conservative bias direction. For classification forests, the PASR estimator is asymptotically unbiased, providing the first pointwise confidence intervals for predicted conditional probabilities from a deployed forest. Nominal coverage is maintained across a range of design configurations for both outcome types, including high-dimensional settings. The underlying theory extends to any tree-based ensemble with an exchangeable tree-generating mechanism.






Supplementary Document

Neural Information Processing Systems

The pseudo-code of plugging our method into the vanilla BO is summarised in Algorithm 1. Therefore, our method is applicable to any other variants of BO in a plug-in manner. In this section, we present the proofs associated with the theoretical assertions from Section 2. To Lemma 1. Assume the GP employs a stationary kernel Lemma 2. Given Lemma 1, determining Proposition 2. Leveraging Lemma 2, suppose Lemma 3. As per Srinivas et al., the optimization process in BO can be conceptualized as a sampling Pr null |f ( x) µ(x) | ωσ ( x) null > δ, (24) where δ > 0 signifies the confidence level adhered to by the UCB. This lemma is directly from Srinivas et al. . The proof can be found therein. Theorem 1. Leveraging Corollary 1, when employing the termination method proposed in this paper, As discussed in Remark 2 of Section 2.2 in the main manuscript, we suggest initializing L-BFGS Different subplots are (a) our proposed method, (b) Naïve method, (c) Nguyen's method, (d) Lorenz's Different subplots are (a) our proposed method, (b) Naïve method, (c) Nguyen's method, (d) Lorenz's Different subplots are (a) our proposed method, (b) Naïve method, (c) Nguyen's method, (d) Lorenz's Different subplots are (a) our proposed method, (b) Naïve method, (c) Nguyen's method, (d) Lorenz's